This notebook contains code and output from exploring the data to inform cleaning steps and analysis process.
library(tidyverse)
library(ggplot2)
library(lubridate)
library(GGally) # for ggpairs to explore correlations
library(psych) # for pairplots to explore correlations
Aug 21
Dataset 1, from https://www.kaggle.com/datasets/emmanuelfwerr/london-homes-energy-data.
daily_energy <- read_csv("../3_raw_data/london_energy.csv") %>%
janitor::clean_names()
Rows: 3510433 Columns: 3── Column specification ─────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): LCLid
dbl (1): KWH
date (1): Date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(daily_energy)
── Data Summary ────────────────────────
Values
Name daily_energy
Number of rows 3510433
Number of columns 3
_______________________
Column type frequency:
character 1
Date 1
numeric 1
________________________
Group variables None
3510433 rows x 3 cols
Rows are one date and one household value. 5,566 households over 829 days (2011-11-23 to 2014-02-28)
No missing values in kaggle aggregated energy data
Aug 23
BUT there are missing values because only households with values on a date are included, not every household has values for every date in range. See below section on CLEANING household data.
# check for duplicates -- NONE
daily_energy %>%
group_by(lc_lid, date) %>%
summarise(count = n()) %>%
filter(count > 1)
`summarise()` has grouped output by 'lc_lid'. You can override using the `.groups` argument.
No duplicates.
Aug 21
Not sure if mean is appropriate measure here, let’s look at histogram of energy consumption values of every household on a single date:
# boxplot on 2012-12-31
daily_energy %>%
filter(date == "2012-12-31") %>%
ggplot(aes(x = kwh)) +
geom_boxplot()
# histogram on 2012-12-31
daily_energy %>%
filter(date == "2012-12-31") %>%
ggplot() +
geom_histogram(aes(x = kwh), bins = 150)
Energy consumption on an individual date, frequency of households, is not normally distributed but right-skewed. Median may be a better average daily measure than mean.
Aug 21
# median
daily_energy %>%
group_by(date) %>%
mutate(median_kwh = median(kwh)) %>%
ggplot(aes(x = date, y = median_kwh)) +
geom_line()
daily_energy %>%
group_by(date) %>%
mutate(mean_kwh = mean(kwh)) %>%
ggplot(aes(x = date, y = mean_kwh)) +
geom_line()
Definitely some seasonality in energy consumption, whichever average used.
August 22
See classnotes wk7 day 1 (new notes in codeclan_work/flipped… folder)
- use package slider
First need to convert data to tsibble:
library(tsibble)
library(slider)
# convert data to tsibble
daily_energy_ts <- as_tsibble(daily_energy, key = lc_lid, index = date)
index_var(daily_energy_ts)
key_vars(daily_energy_ts)
To make rolling average for whole dataset, first find median of all values on a each date (because values are not normally distributed on a single date in December, see above, skewed), then do mean across a 14-day window (set as 7 before, 6 after, only show for complete range).
# make df with median for each day
median_all_daily_energy_ts <- daily_energy_ts %>%
index_by(date) %>%
summarise(median_kwh = median(kwh))
# calc rolling 14-day mean
median_all_daily_energy_ts <- median_all_daily_energy_ts %>%
mutate(kwh_moving_avg = slide_dbl(
.x = median_kwh,
.f = ~ mean(.),
.before = 7,
.after = 6, # 14-day rolling average
.complete = TRUE
))
median_all_daily_energy_ts %>%
ggplot() +
geom_line(aes(x = date, y = median_kwh), colour = "grey") +
geom_line(aes(x = date, y = kwh_moving_avg), colour = "red") +
theme_minimal()
Need to look at seasonal decomposition.
Aug 22
library(feasts)
Use the median values
median_all_daily_energy_ts %>%
autoplot(median_kwh) +
xlab("Date") + ylab("Median daily energy consumption (kWh)")
Interpret:
# look at seasonal pattern
median_all_daily_energy_ts %>%
gg_season(median_kwh)
Seasonal pattern is consistent year-on-year. Maybe 2014 looks like slightly lower usage? Also note some higher variability in winter 2011 data. Need to remove last date because values are off? And first date too?
Look at subseries:
Can see trend for every weekday - does look like Sat & Sun are higher.
median_all_daily_energy_ts %>%
gg_subseries(y = median_kwh, period = 7)
old code, didn’t work, not sure the Q is right so pause this for now
daily_energy_ts %>%
mutate(quarter = quarter(date, type = "date_first", fiscal_start = 12)) %>%
arrange(quarter) %>%
index_by(lc_lid) %>%
summarise(median_kwh = median(kwh))
Error in `validate_index()`:
! Unsupported index type: character
Backtrace:
1. ... %>% summarise(median_kwh = median(kwh))
4. tsibble:::index_by.tbl_ts(., lc_lid)
5. tsibble::build_tsibble(...)
6. tsibble:::validate_index(tbl, !!index2)
When looking for subseries, first change index to season (using quarter) - this doesn’t work for subseries, need it to be date format
quarterly_energy_ts <- daily_energy_ts %>%
# make quarters where Dec is in Q1 of year (adjusted as season not real quarter)
mutate(quarter = quarter(date, type = "date_first", fiscal_start = 12)) %>%
update_tsibble(key = lc_lid, index = quarter)
Error in `validate_tsibble()`:
! A valid tsibble must have distinct rows identified by key and index.
ℹ Please use `duplicates()` to check the duplicated rows.
Backtrace:
1. ... %>% update_tsibble(key = lc_lid, index = quarter)
2. tsibble::update_tsibble(., key = lc_lid, index = quarter)
3. tsibble::build_tsibble(...)
4. tsibble:::validate_tsibble(data = tbl, key = key_vars, key_data = key_data, index = index)
Aug 21
Let’s look at a few individual households:
daily_energy %>%
distinct(lc_lid)
daily_energy %>%
filter(lc_lid %in% c("MAC000002", "MAC000058", "MAC000359", "MAC000714", "MAC000933", "MAC000997")) %>%
ggplot(aes(x = date, y = kwh, colour = lc_lid)) +
geom_line() +
facet_wrap(~lc_lid)
Can already see households are behaving differently, some high consumers, others less so; some change over the year, some less so.
Might want to use a rolling average here, e.g. 7-day, to give a smoother line (as above!)
Might also want to consider if there are “types” of household to split the data by, e.g. according to how variable they are in a year, their level of consumption (high/low) - see more below.
Aug 21
look to see if there are obvious high/medium/low consumer groups
household_alltime_consumption <- daily_energy %>%
group_by(lc_lid) %>%
summarise(alltime_median_kwh = median(kwh),
alltime_mean_kwh = mean(kwh))
household_alltime_consumption
household_alltime_consumption %>%
ggplot() +
geom_histogram(aes(x = alltime_median_kwh), bins = 200)
Look at january or july dates only, when there may be more variation in usage (cold and hot weather)
household_jan_jul_consumption <- daily_energy %>%
mutate(month = month(date, label = TRUE, abbr = FALSE)) %>%
filter(month %in% c("January","July")) %>%
group_by(lc_lid, month) %>%
summarise(median_kwh = median(kwh),
mean_kwh = mean(kwh))
`summarise()` has grouped output by 'lc_lid'. You can override using the `.groups` argument.
# january medians
household_jan_jul_consumption %>%
filter(month == "January") %>%
ggplot() +
geom_histogram(aes(x = median_kwh), bins = 200)
# july medians
household_jan_jul_consumption %>%
filter(month == "July") %>%
ggplot() +
geom_histogram(aes(x = median_kwh), bins = 200)
Most households have median daily consumption in range 0-30kWh in January and 0-20 kWh in July, indicating energy usage is highest in winter
household_jan_jul_consumption %>%
ggplot() +
geom_boxplot(aes(x = month, y = median_kwh))
Maybe energy consumption in January is slightly more than in July. Could test this. Long tail of high energy consumers.
No obvious groups to split into, but could label households above Q3 as “high consumers”, below Q1 as “low consumers”, and rest as “medium consumers”.
all_kwh_stats <- daily_energy %>%
select(kwh) %>%
skimr::skim()
all_kwh_stats %>%
colnames()
[1] "skim_type" "skim_variable" "n_missing" "complete_rate" "numeric.mean"
[6] "numeric.sd" "numeric.p0" "numeric.p25" "numeric.p50" "numeric.p75"
[11] "numeric.p100" "numeric.hist"
kwh_q1 <- all_kwh_stats %>%
select(numeric.p25) %>%
pull()
kwh_q3 <- all_kwh_stats %>%
select(numeric.p75) %>%
pull()
kwh_q1
[1] 4.685
kwh_q3
[1] 12.576
Note these are calculated from all dates, not adjusted for jan/july/seasonality and note the date range covers more winters (3) than summers (2).
# add household type, can be used as lookup key-value (join with full data)
household_alltime_consumption <- household_alltime_consumption %>%
mutate(household_consumption_level = case_when(
alltime_median_kwh >= kwh_q3 ~ "high",
alltime_median_kwh > kwh_q1 ~ "average",
alltime_median_kwh <= kwh_q1 ~ "low",
.default = NA_character_
),
household_consumption_level = factor(household_consumption_level, levels = c("low", "average", "high")))
household_alltime_consumption %>%
ggplot() +
geom_boxplot(aes(x = household_consumption_level, y = alltime_median_kwh))
household_alltime_consumption %>%
summarise(count = n(), .by = household_consumption_level)
Seems like a potentially useful split of data
daily_energy_processed <- left_join(daily_energy, household_alltime_consumption, by = "lc_lid")
daily_energy_processed
Look at median energy over time by type:
daily_energy_processed_medians <- daily_energy_processed %>%
group_by(date, household_consumption_level) %>%
summarise(median_kwh = median(kwh))
`summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
daily_energy_processed_medians %>%
arrange(date)
daily_energy_processed_medians %>%
ggplot() +
geom_line(aes(x = date, y = median_kwh, colour = household_consumption_level))
There may be a way to group households by consumption in an unsupervised way, i.e. by clustering instead.. tbc
Aug 23
# find households with the 10 lowest kwh scores
daily_energy_processed %>%
arrange(kwh) %>%
head(10)
Arranging by kwh low to high shows lots of zeros – investigate:
hholds_zero_values <- daily_energy_processed %>%
filter(kwh == 0) %>% # 15,168 rows
summarise(count = n(), .by = lc_lid) %>%
arrange(desc(count))
hholds_zero_values
308 households with 0 values.
hholds_zero_values %>%
ggplot() +
geom_col(aes(x = lc_lid, y = count))
hholds_with_zeros <- hholds_zero_values %>%
pull(lc_lid)
daily_energy_processed %>%
filter(lc_lid %in% hholds_with_zeros) %>%
mutate(zero_kwh_value = if_else(kwh == 0, T, F), .before = kwh) %>%
ggplot() +
geom_line(aes(x = date, y = zero_kwh_value, group = lc_lid), alpha = 0.2, colour = "grey")
daily_energy_processed %>%
filter(lc_lid %in% hholds_with_zeros) %>%
mutate(zero_kwh_value = if_else(kwh == 0, T, F), .before = kwh) %>%
ggplot() +
geom_point(aes(x = date, y = zero_kwh_value), colour = "grey", alpha = 0.1)
Maybe some households weren’t collecting data / didn’t join until later in the period.
daily_energy_processed %>%
filter(date > "2013-01-01" & kwh == 0) # still 10,500 zeroes
daily_energy_processed %>%
distinct(date)
829 days
Note no households are all zero values though, highest number of zeroes is 789 days.
Could drop households that have a zero value for more than 40% days in the time range, since we are interested in estimating/understanding energy usage, and zero kwh indicates something else is going on here.
0.4*829
So: Remove the 10 hholds with >= 331 zero values (i.e 40% or more days with 0 kwh):
missing_data_hholds <- daily_energy_processed %>%
filter(kwh == 0) %>% # 15,168 rows
summarise(count = n(), .by = lc_lid) %>%
filter(count >= 331) %>%
pull(lc_lid)
missing_data_hholds
Could impute with each household’s median kwh value for that month and if weekday or weekend. OR could drop these households (remember we have 5500!)
Ideally, try both approaches and see which gives better result
So, daily_energy_trim has 10 fewer households (5,557?)
Check weekdays: * 28 feb 2014 was a friday - yes * 12 oct 2012 was also a friday - yes * check yearseason arranges in correct order - yes
daily_energy_trim %>%
filter(kwh == 0) # now 9,633 zero values
# impute missing kwh values with median value for that hhold that yearseason and wday type
daily_energy_imputed <- daily_energy_trim %>%
mutate(zero_kwh_value = if_else(kwh == 0, T, F), .before = kwh) %>%
group_by(lc_lid, yearseason, is_weekend) %>%
mutate(hh_avg_season_wday = mean(kwh), .after = kwh) %>%
ungroup() %>%
mutate(kwh_imputed = if_else(kwh == 0, hh_avg_season_wday, kwh), .before = kwh)
daily_energy_imputed %>%
filter(zero_kwh_value == T) # 9,633 values needed imputing
daily_energy_imputed %>%
filter(zero_kwh_value == T & kwh_imputed == 0)
# using median: 4,648 values remain
# using mean: 807 zero values remain
## before removing the 10 hholds with 40%+ zero days
# 15,168 values were zero
# using median: 10,037 zero values remain after trying to impute with median for wday & season per hhold
# using mean: 3,195 zero values remain - use mean!
So daily_energy_imputed cleaning steps: * remove 10 households with 40%+ days with zero kWh – these are unusual cases, maybe indicate getting energy from elsewhere, or properties not being lived in much, or something not working with their smart meter * impute remaining zeroes using the mean kwh for that household in that season and according to weekday/weekend type * 807 zero values remain in the data, leave them in, low % of the total 3,503,867 observations (0.02%)
100*807/nrow(daily_energy_imputed)
what does the data look like now for a household with some zeroes imputed?
“MAC002050” had 316 zeroes, so not removed. “MAC002072” had 303.
# check households with lots of zero values MAC000197, MAC004067, MAC000037
daily_energy_imputed %>%
filter(lc_lid %in% c("MAC002050", "MAC002072")) %>%
ggplot(aes(x = date, group = lc_lid)) +
geom_line(aes(y = kwh), colour = "blue", linetype = 1) +
geom_line(aes(y = kwh_imputed), colour = "red", linetype = 1) +
facet_wrap(~ lc_lid, ncol = 1)
Actually, there are whole periods of zeroes – these should also be removed
hholds_zero_values %>%
filter(!lc_lid %in% missing_data_hholds) # filter for hholds not already removed
e.g. MAC003707, MAC002771 have 100 zeroes; MAC001655 has 78 zeroes; MAC004756 has 41
daily_energy_imputed %>%
filter(lc_lid %in% c("MAC003707", "MAC002771", "MAC001655", "MAC004756")) %>%
ggplot(aes(x = date, group = lc_lid)) +
geom_line(aes(y = kwh), colour = "blue", linetype = 1) +
geom_line(aes(y = kwh_imputed), colour = "red", linetype = 1) +
facet_wrap(~ lc_lid, ncol = 1)
Check that each household has full 829 days
daily_energy_processed %>%
group_by(lc_lid) %>%
summarise(count = n()) %>%
filter(count < 829)
5,555 of the 5,566 households do not have complete data for the whole time period! i.e. only 11 households have data for the whole time.
What is the threshold at which I don’t include a household?
Reasonably, a year’s worth of data is sufficient for modelling weather and energy interactions…
hholds_lt1y <- daily_energy_processed %>%
group_by(lc_lid) %>%
summarise(count = n()) %>%
filter(count < 365) %>%
pull(lc_lid)
hholds_lt1y
182 households have less than 365 days’ worth of data, in the 829-day period.
daily_energy_processed %>%
filter(!lc_lid %in% hholds_lt1y) %>%
filter(kwh == 0) %>% # 14,019 zero kWh values
summarise(num_zeros = n(), .by = lc_lid) %>%
arrange(desc(num_zeros)) # 246 hholds have zero values
Repeat the cleaning steps above
hholds_zero_values2 <- daily_energy_processed %>%
filter(!lc_lid %in% hholds_lt1y) %>%
filter(kwh == 0) %>% # 14,019 zero kWh values
summarise(num_zeros = n(), .by = lc_lid) %>%
arrange(desc(num_zeros))
hholds_zero_values2
308 households with 0 values, 246 if remove the hholds with less than 365 days’ data
hholds_zero_values2 %>%
filter(num_zeros <= 30)
164 hholds have 30 or fewer zero values
total_days = as.numeric(max(daily_energy$date) - min(daily_energy$date) + 1)
hhold_stats <- daily_energy_processed %>%
mutate(zero_kwh = if_else(kwh == 0, T, F)) %>%
group_by(lc_lid) %>%
summarise(num_values = n(),
num_zeros = sum(zero_kwh),
min_value = min(kwh),
max_value = max(kwh),
range = range(kwh),
median_kwh = median(kwh),
iqr = IQR(kwh),
mean_kwh = mean(kwh),
sd = sd(kwh)
) %>%
mutate(pc_missing = (100 * (total_days - num_values) / total_days), .after = lc_lid) %>%
mutate(pc_zeros = (100 * num_zeros / num_values), .after = pc_missing) %>%
ungroup()
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
Please use `reframe()` instead.
When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an ungrouped data frame and adjust accordingly.`summarise()` has grouped output by 'lc_lid'. You can override using the `.groups` argument.
hhold_stats %>%
arrange(desc(pc_missing))
hhold_stats %>%
arrange(desc(pc_zeros))
hhold_stats %>%
ggplot() +
geom_point(aes(x = pc_missing, y = pc_zeros))
hhold_stats %>%
ggplot() +
geom_histogram(aes(x = pc_missing), bins = 100)
hhold_stats %>%
ggplot() +
geom_histogram(aes(x = pc_zeros), bins = 100)
hhold_stats %>%
filter(pc_zeros == 100)
hhold_stats %>%
filter(pc_zeros > 10)
146 households with > 10% values as 0 kWh
0 kWh could be real, e.g. on holiday/out of house and everything switched off, or even if have solar panels and switch to only using that on sunny days (could look at high sun low energy days)
But overall, lots of 0 values is not useful for modelling change in energy usage, most households don’t have so many zeroes - so just remove these households, can’t explain them right now
hholds_10pc_zero <- hhold_stats %>%
filter(pc_zeros > 10) %>%
pull(lc_lid)
hholds_10pc_zero
[1] "MAC000037" "MAC000037" "MAC000134" "MAC000134" "MAC000197"
[6] "MAC000197" "MAC000242" "MAC000242" "MAC000287" "MAC000287"
[11] "MAC000399" "MAC000399" "MAC000418" "MAC000418" "MAC000636"
[16] "MAC000636" "MAC000895" "MAC000895" "MAC000912" "MAC000912"
[21] "MAC001051" "MAC001051" "MAC001082" "MAC001082" "MAC001101"
[26] "MAC001101" "MAC001150" "MAC001150" "MAC001181" "MAC001181"
[31] "MAC001309" "MAC001309" "MAC001311" "MAC001311" "MAC001340"
[36] "MAC001340" "MAC001348" "MAC001348" "MAC001456" "MAC001456"
[41] "MAC001629" "MAC001629" "MAC001654" "MAC001654" "MAC001655"
[46] "MAC001655" "MAC001777" "MAC001777" "MAC001898" "MAC001898"
[51] "MAC001921" "MAC001921" "MAC001984" "MAC001984" "MAC002050"
[56] "MAC002050" "MAC002072" "MAC002072" "MAC002096" "MAC002096"
[61] "MAC002110" "MAC002110" "MAC002136" "MAC002136" "MAC002224"
[66] "MAC002224" "MAC002374" "MAC002374" "MAC002388" "MAC002388"
[71] "MAC002465" "MAC002465" "MAC002564" "MAC002564" "MAC002594"
[76] "MAC002594" "MAC002612" "MAC002612" "MAC002771" "MAC002771"
[81] "MAC002842" "MAC002842" "MAC002863" "MAC002863" "MAC002873"
[86] "MAC002873" "MAC002976" "MAC002976" "MAC003125" "MAC003125"
[91] "MAC003156" "MAC003156" "MAC003207" "MAC003207" "MAC003246"
[96] "MAC003246" "MAC003614" "MAC003614" "MAC003627" "MAC003627"
[101] "MAC003707" "MAC003707" "MAC003777" "MAC003777" "MAC003973"
[106] "MAC003973" "MAC004047" "MAC004047" "MAC004067" "MAC004067"
[111] "MAC004292" "MAC004292" "MAC004421" "MAC004421" "MAC004672"
[116] "MAC004672" "MAC004677" "MAC004677" "MAC004723" "MAC004723"
[121] "MAC004735" "MAC004735" "MAC004756" "MAC004756" "MAC004854"
[126] "MAC004854" "MAC004931" "MAC004931" "MAC004982" "MAC004982"
[131] "MAC005033" "MAC005033" "MAC005313" "MAC005313" "MAC005556"
[136] "MAC005556" "MAC005558" "MAC005558" "MAC005559" "MAC005559"
[141] "MAC005560" "MAC005560" "MAC005563" "MAC005563" "MAC005565"
[146] "MAC005565"
hhold_stats %>%
filter(!lc_lid %in% hholds_10pc_zero) %>%
filter(median_kwh == 0)
36 households with median of 0 –> indicates not useful data? None of these once the >10pc zero hholds removed
hholds_180obs <- hhold_stats %>%
filter(!lc_lid %in% hholds_10pc_zero) %>%
filter(num_values < 180) %>%
pull(lc_lid)
hholds_180obs
[1] "MAC001278" "MAC001278" "MAC001300" "MAC001300" "MAC001477"
[6] "MAC001477" "MAC001653" "MAC001653" "MAC001957" "MAC001957"
[11] "MAC002155" "MAC002155" "MAC003197" "MAC003197" "MAC003202"
[16] "MAC003202" "MAC003346" "MAC003346" "MAC003347" "MAC003347"
[21] "MAC003353" "MAC003353" "MAC003481" "MAC003481" "MAC003554"
[26] "MAC003554" "MAC003559" "MAC003559" "MAC003572" "MAC003572"
[31] "MAC003592" "MAC003592" "MAC003935" "MAC003935" "MAC004679"
[36] "MAC004679" "MAC005510" "MAC005510"
38 more households have fewer than 180 days’ worth of data, remove these. Can’t guarantee these are consecutive days, but 6 months’ worth of data should give enough variability in weather. Imagine making a model per household - 180 values would be enough I think.
Cleaning households data:
2014-02-28 has v low median value (at end of ts) - unusal, remove
daily_energy %>%
filter(date == "2014-02-28") # 4,987 obs, so most households
daily_energy %>%
filter(date == "2011-11-23") # start_date
Start date has only 13 households
Consider cleaning for dates that have minimum 50 households?
num_hholds_by_date <- daily_energy %>%
group_by(date) %>%
summarise(num_hholds = n())
num_hholds_by_date
num_hholds_by_date %>%
ggplot() +
geom_line(aes(x = date, y = num_hholds))
Check for cleaned date that removed households with little data:
num_hholds_by_date_clean <- daily_energy_clean %>%
group_by(date) %>%
summarise(num_hholds = n())
num_hholds_by_date_clean
num_hholds_by_date_clean %>%
ggplot() +
geom_line(aes(x = date, y = num_hholds))
What is minimal number of households to take a median from? 200?
That would remove 19 dates (i.e. start from 2011-12-12)
num_hholds_by_date_clean %>%
filter(num_hholds <200)
Or if start from 2011-12-01:
num_hholds_by_date_clean %>%
filter(date >= "2011-12-01") %>%
arrange(num_hholds)
Lowest value on a single day would be 91 households. That seems ok.
Start from 2011-12-01 - and also do this for weather_data.
Remember other processing I could add in:
daily_energy_trim code chunk has steps to add in date
values like weekday, season, etc - useful for model too.daily_energy_processed has computed alltime medians etc
for comparisondaily_energy_imputed has remaining zeros imputed using
median values for that season and weekday type – if zeroes seem
unrealdaily_energy_ts - is a tsibble versiondaily_energy_rolling_median - has 14-day moving average
for plotting as timeseriesdaily_energy_clean <- daily_energy %>%
filter(!lc_lid %in% hholds_10pc_zero) %>%
filter(!lc_lid %in% hholds_180obs) %>%
filter(date >= "2011-12-01" & date < "2014-02-28")
# start in Winter 2011 (91 households by that point)
# and remove last date, very low median value (unusual)
num_hholds_og <- length(unique(daily_energy$lc_lid))
num_hholds_clean <- length(unique(daily_energy_clean$lc_lid))
num_hholds_og
[1] 5566
num_hholds_clean
[1] 5474
num_hholds_og - num_hholds_clean
[1] 92
100* (num_hholds_og - num_hholds_clean) / num_hholds_og
[1] 1.652893
92 households (1.6%) removed from data, leaving 5,474 households in “cleaned” data.
total_days = as.numeric(max(daily_energy$date) - min(daily_energy$date) + 1)
hhold_stats <- daily_energy_clean %>%
mutate(zero_kwh = if_else(kwh == 0, T, F)) %>%
group_by(lc_lid) %>%
summarise(num_values = n(),
num_zeros = sum(zero_kwh),
min_value = min(kwh),
max_value = max(kwh),
range = range(kwh),
median_kwh = median(kwh),
iqr = IQR(kwh),
mean_kwh = mean(kwh),
sd = sd(kwh)
) %>%
mutate(pc_missing = (100 * (total_days - num_values) / total_days), .after = lc_lid) %>%
mutate(pc_zeros = (100 * num_zeros / num_values), .after = pc_missing) %>%
ungroup()
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
Please use `reframe()` instead.
When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an ungrouped data frame and adjust accordingly.`summarise()` has grouped output by 'lc_lid'. You can override using the `.groups` argument.
hhold_stats %>%
arrange(desc(pc_missing))
hhold_stats %>%
arrange(desc(pc_zeros))
hhold_stats %>%
ggplot() +
geom_point(aes(x = pc_missing, y = pc_zeros))
hhold_stats %>%
ggplot() +
geom_histogram(aes(x = pc_missing), bins = 100)
hhold_stats %>%
ggplot() +
geom_histogram(aes(x = pc_zeros), bins = 100)
# median by range - profile of variance
hhold_stats %>%
filter(median_kwh != 0) %>%
ggplot() +
geom_point(aes(x = median_kwh, y = iqr))
hhold_stats %>%
filter(median_kwh != 0) %>%
ggplot() +
geom_point(aes(x = max_value, y = median_kwh))
Doesn’t look like groups
hhold_stats %>%
filter(median_kwh != 0) %>%
ggplot() +
geom_point(aes(x = max_value, y = min_value))
Does look like there may be correlations, and maybe some groups could be made here
todo could look at ratio of max/min as delta_change
# check for min of 0
hhold_stats %>%
filter(min_value == 0) # 454 rows
# check for max of 0
hhold_stats %>%
filter(max_value == 0)
# 0 households
Calculate change as (max-min) or range / max (because max != 0)
hhold_stats %>%
mutate(delta = range / max_value) %>%
ggplot() +
geom_point(aes(x = max_value, y = delta))
there is a shape here - low max_value households have higher delta (i.e. more reactive/changeable)
hhold_stats %>%
filter(min_value != 0)
No, 10,546 still have min of 0 so their range/max will be 1
Try a measure of change being IQR / max
hhold_stats %>%
mutate(delta = iqr / max_value) %>%
ggplot() +
geom_point(aes(x = max_value, y = delta))
Could be some groups here, beyond the main cluster - there are hholds with high max values and low delta, suggesting that they remain high energy users regardless (i.e. not much change)
others are low max value and higher delta, indicating they are more ractive/responsive energy consumers
hhold_stats %>%
mutate(delta = iqr / max_value) %>%
ggplot() +
geom_point(aes(x = min_value, y = delta))
Plotting v min_value is less informative, many households have low min_value. No real pattern in delta for households with higher min_value - maybe a positive correlation though?
Idea: * split energy data for weekend v weekday before analysing more
Aug 23
We know season has an effect, see above time series
In looking at the seasonal decomposition, we can see weekdays are lower usage than weekends
Split summer v winter, and then split weekday v weekend –> viz & stats
Aug 23
Look at seasonality again
First need to convert data to tsibble:
library(tsibble)
library(slider)
# convert data to tsibble
daily_energy_clean_ts <- as_tsibble(daily_energy_clean, key = lc_lid, index = date)
index_var(daily_energy_ts)
[1] "date"
key_vars(daily_energy_ts)
[1] "lc_lid"
To make rolling average for whole dataset, first find median of all values on a each date (because values are not normally distributed on a single date in December, see above, skewed), then do mean across a 14-day window (set as 7 before, 6 after, only show for complete range).
# make df with median for each day
median_14d_rolling_energy_clean_ts <- daily_energy_clean_ts %>%
index_by(date) %>%
summarise(median_kwh = median(kwh)) %>%
# add moving_avg using the mean
mutate(kwh_moving_avg = slide_dbl(
.x = median_kwh,
.f = ~ mean(.),
.before = 7,
.after = 6, # 14-day rolling average
.complete = TRUE
))
median_14d_rolling_energy_clean_ts %>%
ggplot() +
geom_line(aes(x = date, y = median_kwh), colour = "grey") +
geom_line(aes(x = date, y = kwh_moving_avg), colour = "red") +
theme_minimal()
Seasons still look like winter high, summer low. Also looks like energy usage reducing year by year
library(feasts)
Use the median values
median_14d_rolling_energy_clean_ts %>%
autoplot(median_kwh) +
xlab("Date") + ylab("Median daily energy consumption (kWh)")
# look at seasonal pattern
median_14d_rolling_energy_clean_ts %>%
gg_season(median_kwh)
Seasonal pattern is consistent year-on-year. Maybe 2014 looks like slightly lower usage? Also note some higher variability in winter 2011 data.
Need to remove last date because values are off? And first date too? Have removed in cleaning step now.
Aug 23
Look at subseries:
Can see trend for every weekday - does look like Sat & Sun are higher.
median_14d_rolling_energy_clean_ts %>%
gg_subseries(y = median_kwh, period = 7)
Yes, weekends blue lines still look higher than for weekdays
Aug 23
Split into winter and summer data only and repeat to see if there is seasonal interaction here
seasons_energy_clean_ts <- daily_energy_clean_ts %>%
mutate(weekday = wday(date, label = TRUE), .after = date) %>%
mutate(is_weekend = if_else(weekday %in% c("Sat","Sun"), T, F), .after = weekday) %>%
# note don't use zoo function on tsibble, use yearmonth()
mutate(yearmonth = yearmonth(date), .after = date) %>%
mutate(month = month(date, label = FALSE), .after = date) %>%
mutate(quarter = quarter(date, type = "date_first", fiscal_start = 12), .after = date) %>%
mutate(yearseason = case_when(
#quarter == "2011-09-01" ~ "Autumn 2011",
quarter == "2011-12-01" ~ "Winter 2011",
quarter == "2012-03-01" ~ "Spring 2012",
quarter == "2012-06-01" ~ "Summer 2012",
quarter == "2012-09-01" ~ "Autumn 2012",
quarter == "2012-12-01" ~ "Winter 2012",
quarter == "2013-03-01" ~ "Spring 2013",
quarter == "2013-06-01" ~ "Summer 2013",
quarter == "2013-09-01" ~ "Autumn 2013",
quarter == "2013-12-01" ~ "Winter 2013",
.default = NA_character_
), .after = quarter) %>%
mutate(yearseason = factor(yearseason, levels = c(#"Autumn 2011",
"Winter 2011", "Spring 2012", "Summer 2012",
"Autumn 2012", "Winter 2012", "Spring 2013",
"Summer 2013", "Autumn 2013", "Winter 2013")))
# make a summer subset (Summer 2012, 2013)
summer_energy_ts <- seasons_energy_clean_ts %>%
filter(str_detect(yearseason, "Summer*"))
# add medians & rolling avg for winter only
winter_rolling_ts <- winter_energy_ts %>%
index_by(date) %>%
summarise(median_kwh = median(kwh)) %>%
# add moving_avg using the mean
mutate(kwh_moving_avg = slide_dbl(
.x = median_kwh,
.f = ~ mean(.),
.before = 7,
.after = 6, # 14-day rolling average
.complete = TRUE
)) %>%
fill_gaps()
winter_rolling_ts %>%
ggplot() +
geom_line(aes(x = date, y = median_kwh), colour = "grey") +
geom_line(aes(x = date, y = kwh_moving_avg), colour = "red") +
theme_minimal()
winter_rolling_ts %>%
fill_gaps() %>%
gg_subseries(y = median_kwh, period = 7)
winter_rolling_wday_ts <- winter_rolling_ts %>%
mutate(weekday = wday(date, label = TRUE), .after = date) %>%
mutate(is_weekend = if_else(weekday %in% c("Sat","Sun"), T, F), .after = weekday)
winter_rolling_wday_ts
winter_rolling_wday_ts %>%
ggplot() +
geom_boxplot(aes(x = is_weekend, y = median_kwh))
Warning says: Removed 550 rows containing non-finite values
winter_rolling_wday_ts %>%
filter(is_weekend) %>%
ggplot() +
geom_histogram(aes(x = median_kwh), bins = 50)
Warning: Removed 157 rows containing non-finite values
Looks approximately normal (some missing values)
winter_rolling_wday_ts %>%
filter(!is_weekend) %>%
ggplot() +
geom_histogram(aes(x = median_kwh), bins = 50)
Warning: Removed 393 rows containing non-finite values
Not normal, has 1 row with very low median
winter_rolling_wday_ts %>%
arrange(median_kwh)
Warning: Current temporal ordering may yield unexpected results.
ℹ Suggest to sort by ``, `date` first.Warning: Current temporal ordering may yield unexpected results.
ℹ Suggest to sort by ``, `date` first.
The low value is the last date of the trial - looks odd in other graphs, so remove it from the dataset. <– old, dealth with by removing 2014-02-28 in cleaning step
todo: DEAL WITH WARNINGS ABOVE Note they sum to 550!
TODO
TODO
Dataset 2, from https://www.kaggle.com/datasets/emmanuelfwerr/london-weather-data.
Aug 23
daily_weather <- read_csv("../3_raw_data/london_weather.csv")
Rows: 15341 Columns: 10── Column specification ──────────────────────────────────────────
Delimiter: ","
dbl (10): date, cloud_cover, sunshine, global_radiation, max_t...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(daily_weather)
date is not date value dates cover 1979 to 2020 december, filter to range of energy data
start_date <- min(daily_energy_clean$date)
start_date
[1] "2011-12-01"
end_date <- max(daily_energy_clean$date)
end_date
[1] "2014-02-27"
daily_weather_trim <- daily_weather %>%
mutate(date = ymd(date)) %>%
filter(date >= start_date & date <= end_date)
skimr::skim(daily_weather_trim)
── Data Summary ────────────────────────
Values
Name daily_weather_trim
Number of rows 820
Number of columns 10
_______________________
Column type frequency:
Date 1
numeric 9
________________________
Group variables None
Aug 23
# join energy data with weather data
daily_energy_weather <- left_join(x = daily_energy_clean, y = daily_weather_trim, by = join_by(date))
daily_energy_weather
Aug 23
do households respond differently to weather? can we identify households that might need to improve their heating (ie high energy usage on cold days) - both in terms of energy efficiency and home insulation
first, check if mean_temp and kwh correlate (for high/average/low consumer medians) –> load weather data
Remaining ideas:
written scripts to prepare data (including joined data), moving to new notebook now to explore weather data and household energy/weather features